Background

My future career is going to be in the field of Data Science. I have recently been reading a book named “The New Geography of Jobs” - Enrico Moretti. Where he discusses a large number of geographical aspects that control a number of factors. One of the thoughts he shares is that of how the more educated people move to where the money is in the same way geese migrate south for the winter. I wanted to look at the cost of living per states and the number of factors that could affect the Data Science salaries.

library(tidyverse)
library(ggplot2)
library(readr)
library(plotly)
library(gapminder)
library(DT)
library(pander)
library(car)
library(dplyr)
# put in our state cpi data
cpi <- read_csv("../Data/data.csv")
# data science
salaries <- read.csv("../Data/state_data_sci.csv")

Hide Data

Show CPI Data

This data was taken and downloaded from a government website. I believe it was the Beareau of Labor Statistics

datatable(cpi, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")

Show Job Data

This Data was taken from Ziprecruiter.com. It should be noted that the salary information does not cover every one of the 50 states and there are some missing values.

datatable(salaries, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")

Analysis

us_states <- map_data("state")
p <- ggplot(data = us_states,
            mapping = aes(x = long, y = lat,
                          group = group, fill = region))

salaries <- salaries %>%
  mutate(State = tolower(State))

# Info we want for future
cpi1 <- cpi %>%
  mutate(state = tolower(state))

# temp df for us_state join
cpi_state <- cpi1 

us_states <- us_states %>%
  left_join(cpi_state, by=c("region"="state")) %>%
  left_join(salaries, by=c("region" = "State"))

us_states <- us_states %>%
  mutate(indexed_salary = (Annual.Salary/costIndex))

Population

Geography

CPI_gis <- us_states %>%
  ggplot(aes(x=long,y=lat,group=group, fill=pop2023)) +
  geom_polygon(color = "gray90", size = 0.1) +
  coord_map(projection = "albers", lat0 = 45, lat1 = 55) +
  scale_fill_continuous(type = "viridis")+
  theme(legend.position="bottom",
        axis.line=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank(),
        panel.background=element_blank(),
        panel.border=element_blank(),
        panel.grid=element_blank())

cpi_gisplot <- ggplotly(CPI_gis)

cpi_gisplot 

The state map above displays the lower 48 states color coded according to their population. We can see that state with the largest population is currently California being followed by Texas, New York, and Florida.

Linear Regression

This analysis attempts to model the relationship between Annual Salary and Population in 2023. Specifically,

\[ \underbrace{Y_i}_\text{Annual Salary} = \beta_0 + \beta_1 \underbrace{X_i}_\text{Population in 2023} + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \sigma^2) \]

The hypotheses for our study concern the slope of the regression model, \(\beta_1\). If the slope is zero, then there is not a meaningful relationship between the Annual Salary and Population in 2023.

\[ H_0: \beta_1 = 0 \] \[ H_a: \beta_1 \neq 0 \] \[ \alpha = 0.05 \]

collected <- cpi1  %>%
  left_join(salaries, by=c("state" = "State")) %>%
  mutate(indexed_salary = (Annual.Salary/costIndex))

collected1 <- collected %>%
  filter(pop2023 < 30000000)

### linear regression
plot(Annual.Salary ~ pop2023, data= collected1, col= "firebrick3",pch=16, main="Population Effect On Salary")

salary.lm <- lm(Annual.Salary ~ pop2023, data= collected1)
coefvec <- coef(salary.lm)

abline(salary.lm, lwd=2, col="blue3")

abline(v=seq(40,120,20), h=seq(40,120,20), lty=2, col=rgb(.6,.6,.6,.2))

The equation of the estimated regression equation from the scatterplot above is given by

\[ \underbrace{\hat{Y}_i}_\text{Annual Salary} = 110632 - 0.0003156 \underbrace{X_i}_\text{Population} \]

The estimated slope of this regression is 0.0003156, and the p-value is \((p = 0.3128 )\) showing insignificant evidence that the slope is different than zero. Follows is a complete table showing the summary of the full regression results.

pander(summary(salary.lm))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 110632 1978 55.93 2.822e-34
pop2023 0.0003156 0.0003079 1.025 0.3128
Fitting linear model: Annual.Salary ~ pop2023
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
35 7740 0.03085 0.001483

Interpretation

The slop is not significantly different from zero showing that there is no relationship between the Salary of Data Scientists and the population of a state. Though this may be true for the over all state, it may be a different story at the more local level as in cities or even possibly counties.

Appropriateness of the above Regression

You can see the normality of the residuals shown in the Normal Q-Q plot.On top of that by the residuals versus fitted values plot, there are no real concerns with linearity and the vertical variation in the residuals seems constant enough for all fitted-values to assume constant variance of the errors. There are some odd patterns visible in this plot which are due to the integer values of the data, but this is okay. The residuals vs. order plot shows that error terms shows no patterns.

# Note that this r-chunk uses ```{r, fig.height=3}
# to reduce the size of the plots. This makes the plots
# look a little nicer and also emphasizes that they are not
# as important as the main graphic (scatterplot) of the analysis.

par(mfrow=c(1,3))
plot(salary.lm, which=1:2)
plot(salary.lm$residuals, main="Residuals vs Order", xlab="",
     ylab="Residuals")

Quality of Living

Geography

index <- us_states %>%
  ggplot(aes(x=long,y=lat,group=group, fill=indexed_salary)) +
  geom_polygon(color = "gray90", size = 0.1) +
  coord_map(projection = "albers", lat0 = 45, lat1 = 55) +
  scale_fill_continuous(type = "viridis")+
  theme(legend.position="bottom",
        axis.line=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank(),
        panel.background=element_blank(),
        panel.border=element_blank(),
        panel.grid=element_blank())

index <- ggplotly(index)

index 

In the geospatial graph above we can see the index_salary for for multiple states geographically. The Indexed Salary shows the ratio of Annual Salary over the Cost Index for the coinciding state. I should be noted that the grey areas are that of states that having missing values for Data Science Salaries so as the ratio could not be calculated.

What State has the Best Quality of Living for Data Scientist?

collected <- cpi1  %>%
  left_join(salaries, by=c("state" = "State")) %>%
  mutate(indexed_salary = (Annual.Salary/costIndex))

temp_df <- collected %>%
  filter(indexed_salary > 1120)

temp_df$state <- factor(temp_df$state, levels = unique(temp_df$state)[order(temp_df$indexed_salary, decreasing = TRUE)])


fig <- plot_ly(temp_df)
fig <- fig %>% add_bars(x= ~state, y= ~indexed_salary, color= ~state)
fig <- fig %>% layout(title="Data Scientist Quality of Living", xaxis= list(title= "State"), yaxis= list(title="Quality of Living"), legend = list(title= list(text = "State")))
fig

This bar chart essentially shows us the the top 10 states for living on a data scientists salary based off of the average annual salary and the CPI index information to create an index for the quality of life for data scientist. We can see that the top three states for this ratio are: Tennessee, Idaho, and West Virginia. I am from Iowa and its nice to see that it at least made the top 10 list.

Conclusions

From our analysis there were a number of different things we learned of those we learned that the Population of a State has no affect on the Average Annual Salary of a Data Scientist. We also learned that Tennessee and Idaho are two states that have the best bang for your buck as a Data Scientist because the index_salary ratio was the largest in such states.